This is Part C of APAN5205 Group 6’s final report.

Data Cleanning

This part is the same as in the final report so we hide the code and output.

## 'data.frame':    42656 obs. of  6 variables:
##  $ Review_ID        : int  670772142 670682799 670623270 670607911 670607296 670591897 670585330 670574142 670571027 670570869 ...
##  $ Rating           : int  4 4 4 4 4 3 5 3 2 5 ...
##  $ Year_Month       : chr  "2019-4" "2019-5" "2019-4" "2019-4" ...
##  $ Reviewer_Location: chr  "Australia" "Philippines" "United Arab Emirates" "Australia" ...
##  $ Review_Text      : chr  "If you've ever been to Disneyland anywhere you'll find Disneyland Hong Kong very similar in the layout when you"| __truncated__ "Its been a while since d last time we visit HK Disneyland .. Yet, this time we only stay in Tomorrowland .. AKA"| __truncated__ "Thanks God it wasn   t too hot or too humid when I was visiting the park   otherwise it would be a big issue (t"| __truncated__ "HK Disneyland is a great compact park. Unfortunately there is quite a bit of maintenance work going on at prese"| __truncated__ ...
##  $ Branch           : chr  "Disneyland_HongKong" "Disneyland_HongKong" "Disneyland_HongKong" "Disneyland_HongKong" ...
## Rows: 42,656
## Columns: 6
## $ Review_ID         <int> 670772142, 670682799, 670623270, 670607911, 67060729…
## $ Rating            <int> 4, 4, 4, 4, 4, 3, 5, 3, 2, 5, 5, 5, 4, 5, 5, 3, 4, 3…
## $ Year_Month        <chr> "2019-4", "2019-5", "2019-4", "2019-4", "2019-4", "2…
## $ Reviewer_Location <chr> "Australia", "Philippines", "United Arab Emirates", …
## $ Review_Text       <chr> "If you've ever been to Disneyland anywhere you'll f…
## $ Branch            <chr> "Disneyland_HongKong", "Disneyland_HongKong", "Disne…
## integer(0)
## [1] 2613    6
## [1] 40043     6
## [1] 4 3 5 2 1
##   average_rating median_rating
## 1       4.231102             5

## 
##     1     2     3     4     5 
##  1338  1929  4782 10086 21908
## 
##          1          2          3          4          5 
## 0.03341408 0.04817321 0.11942162 0.25187923 0.54711185
## 
## negative  neutral positive 
##     3267     4782    31994

## [1] 162
## 
##                       Afghanistan                           Albania 
##                                 2                                 6 
##                           Algeria                           Andorra 
##                                 2                                 1 
##               Antigua and Barbuda                         Argentina 
##                                 1                                25 
##                           Armenia                             Aruba 
##                                 1                                 2 
##                         Australia                           Austria 
##                              4412                                27 
##                        Azerbaijan                           Bahrain 
##                                 2                                39 
##                        Bangladesh                          Barbados 
##                                12                                 5 
##                           Belgium                           Bolivia 
##                               132                                 3 
##            Bosnia and Herzegovina                          Botswana 
##                                 7                                 3 
##                            Brazil                            Brunei 
##                                94                                18 
##                          Bulgaria                          Cambodia 
##                                16                                 7 
##                            Canada             Caribbean Netherlands 
##                              2116                                 1 
##                    Cayman Islands                             Chile 
##                                 1                                18 
##                             China                          Colombia 
##                               167                                11 
##                      Cook Islands                        Costa Rica 
##                                 2                                 9 
##                           Croatia                              Cuba 
##                                16                                 1 
##                           Curacao                            Cyprus 
##                                 1                                45 
##                           Czechia  Democratic Republic of the Congo 
##                                27                                 1 
##                           Denmark                Dominican Republic 
##                                82                                 4 
##                           Ecuador                             Egypt 
##                                 3                                75 
##                       El Salvador                           Estonia 
##                                 1                                 9 
##                          Ethiopia Falkland Islands (Islas Malvinas) 
##                                 3                                 2 
##                              Fiji                           Finland 
##                                 5                                60 
##                      Five Islands                            France 
##                                 1                               223 
##                  French Polynesia                           Georgia 
##                                 3                                 2 
##                           Germany                             Ghana 
##                               182                                 2 
##                         Gibraltar                            Greece 
##                                 8                               101 
##                           Grenada                              Guam 
##                                 1                                16 
##                         Guatemala                          Guernsey 
##                                 8                                 8 
##                             Haiti                          Honduras 
##                                 2                                 2 
##                         Hong Kong                           Hungary 
##                               515                                23 
##                           Iceland                             India 
##                                 5                              1470 
##                         Indonesia                              Iran 
##                               511                                26 
##                              Iraq                           Ireland 
##                                 1                               456 
##                       Isle of Man                            Israel 
##                                 8                               113 
##                             Italy                       Ivory Coast 
##                               117                                 2 
##                           Jamaica                             Japan 
##                                 2                                61 
##                            Jersey                            Jordan 
##                                14                                 8 
##                        Kazakhstan                             Kenya 
##                                 7                                16 
##                            Kuwait                              Laos 
##                                43                                 2 
##                            Latvia                           Lebanon 
##                                 5                                56 
##                             Libya                         Lithuania 
##                                 2                                 5 
##                        Luxembourg                             Macau 
##                                12                                35 
##                        Madagascar                            Malawi 
##                                 1                                 2 
##                          Malaysia                          Maldives 
##                               562                                 4 
##                              Mali                             Malta 
##                                 2                                80 
##                         Mauritius                            Mexico 
##                                27                               116 
##                           Moldova                            Monaco 
##                                 4                                 2 
##                          Mongolia                        Montenegro 
##                                 3                                 4 
##                           Morocco                        Mozambique 
##                                 4                                 3 
##                   Myanmar (Burma)                           Namibia 
##                                 7                                 1 
##                             Nepal                       Netherlands 
##                                 6                               239 
##                       New Zealand                         Nicaragua 
##                               714                                 1 
##                           Nigeria                   North Macedonia 
##                                23                                 6 
##          Northern Mariana Islands                            Norway 
##                                 2                                98 
##                              Oman                          Pakistan 
##                                23                                25 
##                            Panama                  Papua New Guinea 
##                                 6                                 1 
##                              Peru                       Philippines 
##                                12                              1024 
##                            Poland                          Portugal 
##                                25                                98 
##                       Puerto Rico                             Qatar 
##                                17                                63 
##                           Romania                            Russia 
##                                93                                43 
##                            Rwanda                      Saudi Arabia 
##                                 3                               114 
##                           Senegal                            Serbia 
##                                 1                                11 
##                        Seychelles                         Singapore 
##                                 4                               971 
##                          Slovakia                          Slovenia 
##                                 9                                 2 
##                   Solomon Islands                      South Africa 
##                                 2                               233 
##                       South Korea                       South Sudan 
##                                36                                 1 
##                             Spain                         Sri Lanka 
##                               132                                34 
##                             Sudan                          Suriname 
##                                 1                                 1 
##                            Sweden                       Switzerland 
##                                94                               117 
##                            Taiwan                          Tanzania 
##                                34                                 4 
##                          Thailand                       The Bahamas 
##                               216                                 2 
##                       Timor-Leste               Trinidad and Tobago 
##                                 1                                 7 
##                           Tunisia                            Turkey 
##                                 4                                50 
##          Turks and Caicos Islands               U.S. Virgin Islands 
##                                 1                                 5 
##                            Uganda                           Ukraine 
##                                 4                                 8 
##              United Arab Emirates                    United Kingdom 
##                               339                              9115 
##                     United States                           Uruguay 
##                             13522                                 7 
##                        Uzbekistan                           Vanuatu 
##                                 1                                 2 
##                         Venezuela                           Vietnam 
##                                 3                                55 
##                            Zambia                          Zimbabwe 
##                                 3                                 2
## Warning in countrycode_convert(sourcevar = sourcevar, origin = origin, destination = dest, : Some values were not matched unambiguously: Five Islands
## 'data.frame':    46 obs. of  28 variables:
##  $ Ride_name                : chr  "Alien Swirling Saucers" "Astro Orbiter" "Avatar Flight of Passage" "Big Thunder Mountain Railroad" ...
##  $ Park_location            : chr  "HS" "MK" "AK" "MK" ...
##  $ Park_area                : chr  "Toy Story Land" "Tomorrowland" "Pandora" "Frontierland" ...
##  $ Ride_type_all            : chr  "spinning" "spinning, slow" "thrill" "thirll, small drops" ...
##  $ Ride_type_thrill         : chr  "No" "No" "Yes" "Yes" ...
##  $ Ride_type_spinning       : chr  "Yes" "Yes" "No" "No" ...
##  $ Ride_type_slow           : chr  "No" "Yes" "No" "No" ...
##  $ Ride_type_small_drops    : chr  "No" "No" "No" "Yes" ...
##  $ Ride_type_big_drops      : chr  "No" "No" "No" "No" ...
##  $ Ride_type_dark           : chr  "No" "No" "No" "No" ...
##  $ Ride_type_scary          : chr  "No" "No" "No" "No" ...
##  $ Ride_type_water          : chr  "No" "No" "No" "No" ...
##  $ Fast_pass                : chr  "Yes" "No" "Yes" "Yes" ...
##  $ Classic                  : chr  "No" "Yes" "No" "Yes" ...
##  $ Age_interest_all         : chr  "all ages" "all ages" "kids, tweens, teens, adults" "kids, tweens, teens, adults" ...
##  $ Age_interest_preschoolers: chr  "Yes" "Yes" "No" "No" ...
##  $ Age_interest_kids        : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Age_interest_tweens      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Age_interest_teens       : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Age_interest_adults      : chr  "Yes" "Yes" "Yes" "Yes" ...
##  $ Height_req_inches        : int  32 0 44 40 0 40 0 44 0 0 ...
##  $ Ride_duration_min        : num  1.5 1.5 5 3.5 4 3.25 1.5 2.75 5 8 ...
##  $ Open_date                : chr  "6/30/18" "2/25/95" "5/27/17" "9/23/80" ...
##  $ Age_of_ride_days         : num  1712 10238 2111 15506 8918 ...
##  $ Age_of_ride_years        : num  4.69 28.03 5.78 42.45 24.42 ...
##  $ Age_of_ride_total        : chr  "4 years 8 months 7 days" "28 years 0 months 11 days" "5 years 9 months 11 days" "42 years 5 months 14 days" ...
##  $ TL_rank                  : int  31 43 9 8 32 24 29 1 27 47 ...
##  $ TA_Stars                 : num  NA 3.5 5 4.5 4.5 4 4.5 5 4 4 ...
## [1] 4
## Warning in cor(rides_cor): the standard deviation is zero

Research Question No.3

Clean and Tokenize

This part is also repeating what we did in part A, so we also hide the code and output.

## Loading required package: NLP
## 
## Attaching package: 'NLP'
## The following object is masked from 'package:ggplot2':
## 
##     annotate
## Warning in tm_map.SimpleCorpus(corpus1, FUN = content_transformer(tolower)):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = removePunctuation):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = removeWords,
## c(stopwords("english"))): transformation drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = removeNumbers): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(corpus1, FUN = stemDocument): transformation
## drops documents
## Warning in TermDocumentMatrix.SimpleCorpus(x, control): custom functions are
## ignored
## 
## Attaching package: 'tidyr'
## The following object is masked from 'package:magrittr':
## 
##     extract
## Selecting by tfidf
## Warning in brewer.pal(9, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

Rides Mentioned in Review

#Testing how to extract reviews mentioning specific rides name
library(dplyr)
library(stringr)
rides_name <- rides$Ride_name
rides_name[1]
## [1] "Astro Orbiter"
disneyland_1 <- disneyland %>%
  mutate(Astro_Orbiter = case_when(grepl(c("astro orbiter", "Astro", "Orbiter"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Astro_Orbiter = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
ao <- disneyland_1 %>%
  filter(Astro_Orbiter == 1)

rides_name[2]
## [1] "Avatar Flight of Passage"
disneyland_2 <- disneyland %>%
  mutate(Avatar_Flight = case_when(grepl(c("Avatar Flight of Passage", "Avatar Flight", "Avatar Ride"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Avatar_Flight = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
af <- disneyland_2 %>%
  filter(Avatar_Flight == 1)

rides_name[3]
## [1] "Big Thunder Mountain Railroad"
disneyland_3 <- disneyland %>%
  mutate(Big_Thunder = case_when(grepl(c("Big Thunder Mountain Railroad", "Big Thunder"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Big_Thunder = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
bt <- disneyland_3 %>%
  filter(Big_Thunder == 1)

We want to detect if some review mentioned specific rides name. For example, if some review mentioned “Astro Orbiter” in Review_Text, we create a new column called Astro_Orbiter. If mentioned, we enter 1 under the Astro_Orbiter column, otherwise we enter 0.
We want to do the same for all 42 rides in the “rides” dataset, and hence we could match ride features with reviews mentioned the ride.

library(dplyr)
library(stringr)

rides_name[1:14]
##  [1] "Astro Orbiter"                                 
##  [2] "Avatar Flight of Passage"                      
##  [3] "Big Thunder Mountain Railroad"                 
##  [4] "Buzz Lightyear's Space Ranger Spin"            
##  [5] "Dinosaur"                                      
##  [6] "Dumbo the Flying Elephant"                     
##  [7] "Expedition Everest"                            
##  [8] "Frozen Ever After"                             
##  [9] "Gran Fiesta Tour Starring The Three Caballeros"
## [10] "Haunted Mansion"                               
## [11] "It's a Small World"                            
## [12] "Journey Into Imagination with Figment"         
## [13] "Jungle Cruise"                                 
## [14] "Kali River Rapids"
disneyland_ride <- disneyland %>%
  mutate(Astro_Orbiter = case_when(grepl(c("astro orbiter", "Astro", "Orbiter"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Avatar_Flight = case_when(grepl(c("Avatar Flight of Passage", "Avatar Flight", "Avatar Ride"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Big_Thunder = case_when(grepl(c("Big Thunder Mountain Railroad", "Big Thunder"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Buzz_Lightyear = case_when(grepl(c("Buzz Lightyear's Space Ranger Spin", "Buzz Lightyear's", "Space Ranger Spin", "Space Ranger"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Dinosaur = case_when(grepl(c("Dinosaur"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Dumbo = case_when(grepl(c("Dumbo the Flying Elephant", "Flying Elephant"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Expedition_Everest = case_when(grepl(c("Expedition Everest"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Frozen_Ever = case_when(grepl(c("Frozen Ever After", "Ever After"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Gran_Fiesta = case_when(grepl(c("Gran Fiesta Tour Starring The Three Caballeros", "Gran Fiesta", "Starring The Three Caballeros", "Three Caballeros"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Haunted_Mansion = case_when(grepl(c("Haunted Mansion"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Small_World= case_when(grepl(c("It's a Small World", "Small World"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Journey_Into= case_when(grepl(c("Journey Into Imagination with Figment", "Imagination with Figment"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Jungle_Cruise= case_when(grepl(c("Jungle Cruise"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Kali_River = case_when(grepl(c("Kali River Rapids", "Kali River"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Astro_Orbiter = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Avatar_Flight = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Big_Thunder = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Buzz_Lightyear = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Dumbo = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Frozen_Ever = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Gran_Fiesta = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Small_World = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Journey_Into = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Kali_River = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
#head(disneyland_ride)
rides_name[15:28]
##  [1] "Kilimanjaro Safaris"             "Living with the Land"           
##  [3] "Mad Tea Party"                   "Mission Space"                  
##  [5] "Na'vi River Journey"             "Peter Pan's Flight"             
##  [7] "Pirates of the Caribbean"        "Primeval Whirl"                 
##  [9] "Prince Charming Regal Carrousel" "Rock 'n' Roller Coaster"        
## [11] "Seven Dwarfs Mine Train"         "Soarin' Around the World"       
## [13] "Space Mountain"                  "Spaceship Earth"
disneyland_ride <- disneyland_ride %>%
  mutate(Kilimanjaro_Safaris = case_when(grepl(c("Kilimanjaro Safaris", "Kilimanjaro"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Living_With = case_when(grepl(c("Living with the Land"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Mad_Tea = case_when(grepl(c("Mad Tea Party", "Mad Tea"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Mission_Space = case_when(grepl(c("Mission Space"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Navi_River = case_when(grepl(c("Na'vi River Journey", "Na'vi"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Peter_Pan = case_when(grepl(c("Peter Pan's Flight", "Peter Pan's"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Pirates = case_when(grepl(c("Pirates of the Caribbean", "Pirates", "Caribbean"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Primeval_Whirl = case_when(grepl(c("Primeval Whirl"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Prince_Charming = case_when(grepl(c("Prince Charming Regal Carrousel", "Regal Carrousel"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Rock_Roller = case_when(grepl(c("Rock 'n' Roller Coaster", "Rock n"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Seven_Dwarfs = case_when(grepl(c("Seven Dwarfs Mine Train", "Mine Train"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Soarin_Around = case_when(grepl(c("Soarin' Around the World", "Soaring Around", "Soarin' Around"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Space_Mountain = case_when(grepl(c("Space Mountain"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Spaceship_Earthn = case_when(grepl(c("Spaceship Earth"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Kilimanjaro_Safaris = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Mad_Tea = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Navi_River = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Peter_Pan = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Pirates = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Prince_Charming = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Rock_Roller = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Seven_Dwarfs = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Soarin_Around = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
rides_name[29:42]
##  [1] "Splash Mountain"                           
##  [2] "Star Tours"                                
##  [3] "Test Track"                                
##  [4] "The Barnstormer"                           
##  [5] "The Magic Carpets of Aladdin"              
##  [6] "The Many Adventures of Winnie the Pooh"    
##  [7] "The Twilight Zone Tower of Terror"         
##  [8] "Tomorrowland Speedway"                     
##  [9] "Tomorrowland Transit Authority PeopleMover"
## [10] "Toy Story Mania"                           
## [11] "TriceraTop Spin"                           
## [12] "Under the Sea"                             
## [13] "Walt Disney World Railroad"                
## [14] "Walt Disney's Carousel of Progress"
disneyland_ride <- disneyland_ride %>%
  mutate(Splash_Mountain = case_when(grepl(c("Splash Mountain"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Star_Tours = case_when(grepl(c("Star Tours"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Test_Track = case_when(grepl(c("Test Track"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Barnstormer = case_when(grepl(c("The Barnstormer"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Magic_Carpets = case_when(grepl(c("The Magic Carpets of Aladdin", "Magic Carpets"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Winnie_Pooh = case_when(grepl(c("The Many Adventures of Winnie the Pooh", "Many Adventures of Winnie"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Twilight_Zone = case_when(grepl(c("The Twilight Zone Tower of Terror", "Twilight Zone", "Tower of Terror"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Tomorrowland_Speedway = case_when(grepl(c("Tomorrowland Speedway"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Tomorrowland_Transit = case_when(grepl(c("Tomorrowland Transit Authority PeopleMover", "Transit Authority"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Toy_Story = case_when(grepl(c("Toy Story Mania"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(TriceraTop_Spin = case_when(grepl(c("TriceraTop Spin", "Tricera Top"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(Under_Sea = case_when(grepl(c("Under the Sea"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>%
  mutate(World_Railroad = case_when(grepl(c("Walt Disney World Railroad", "Disney World Railroad"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0)) %>% 
  mutate(Carousel_Progress = case_when(grepl(c("Walt Disney's Carousel of Progress", "Carousel of Progress"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0))
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Magic_Carpets = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Winnie_Pooh = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Twilight_Zone = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Tomorrowland_Transit = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `TriceraTop_Spin = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `World_Railroad = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used
## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `Carousel_Progress = case_when(...)`.
## Caused by warning in `grepl()`:
## ! argument 'pattern' has length > 1 and only the first element will be used

We manually matched ride names and Review_Text because we want to include possible variation of rides name mentioned in review. For example, there is a ride called “Seven Dwarfs Mine Train”, so we detected both “Seven Dwarfs Mine Train” (the official ride name) and possible abbreviated ride name “Main Train” (also with insensitive letter case). However, we cannot grantee we have included all occurrence. Because if there is a typo in visitors’ review, or another way to call the ride, we would not be able to match them.
There are also ride name including a Disney character, for example “Peter Pan’s Flight”. We include “Peter Pan’s” as abbreviated ride name. But we cannot be sure if reviews using such words is mentioning the ride or the character in Disneyland who is interacting with visitors.
Maybe in future NLP analysis, we can figure out whether “Peter Pan” is referring to the Disney character or the ride name when we can successfully capture the context of the review. We hope to find solutions in future analysis, but for now, we would use the manually matched dataset.

Because some of the rides only exists in the Orlando Disney, so we want to exclude the column associated with rides not being mentioned in any of the Review_Text.

#only keep the row if at least one of the ride name column is 1
disneyland_ride2 <- disneyland_ride %>% 
  select_if(function(x) !all(x == 0)) 
#head(disneyland_ride2)

# Plot the number of times each ride being mentioned in reviews:
rides_sum <- disneyland_ride2 %>% 
  select(11:34) %>% 
  colSums()

rides_sum <- rides_sum[order(-rides_sum)]
barplot(rides_sum, main = "Number of Times Each Ride Being Mentioned", xlab = "Ride Names", ylab = "Count", col = "lightblue", density = 30, las = 2, cex.names = 0.6, ylim = c(0, 3500))

From the above plot, we know that “Space Mountain”, “Pirates”, “Hunted Mansion”, “Star Tours”, and “Splash Mountain” are the five rides being mentioned the most times.

Analyze Ride called Space Mountain (mentioned the most in Review):

Given so many rides, we want to analyze “Space Mountain” as it is the ride being mentioned the most in reviews.

#filter out reviews mentioned Space Mountain
spaceMountain <- disneyland_ride2 %>%
  filter(Space_Mountain == 1) 

library(ggplot2)
plot1 <- ggplot(spaceMountain, aes(Rating)) +
         geom_bar(stat="count", position = "dodge") +
  ggtitle('Rating Distribution for Reviews Mentioned Ride Space Mountain')
plot1

#filter out negative reviews mentioned Space Mountain
spaceMountain_Neg <- spaceMountain %>%
  filter(Rating_type == "negative") 

plot2 <- ggplot(spaceMountain_Neg, aes(Rating)) +
         geom_bar(stat="count", position = "dodge") +
  ggtitle('Rating Distribution for Negtaive Reviews Mentioned Ride Space Mountain')
plot2

#for all Space Mountain(pos and neg) reviews, detect the specific sentence containing Space Mountain regardless of letter case
spaceMountain <- spaceMountain %>%
  mutate(Ride_Sentence = str_extract(Review_Text, "(?i)\\b[^.]*Space Mountain[^.]*\\b"))
#spaceMountain

From the “Rating Distribution for Reviews Mentioned Ride Space Mountain”, we can see that there are more 5 score ratings mentioning “Space Mountain” than lower score ratings. Overall, visitors experience with Space Mountain is positive.
We extracted the specific sentence in Review_Text mentioning Space Mountain Ride, and store the sentence in the Ride_Sentence column.

We hope to use a Binary Sentiment Lexicons called “bing” to categorize words in Ride_Sentence as being positive or negative.

library(tidytext)
as.data.frame(get_sentiments('bing'))[1:50,]
##               word sentiment
## 1          2-faces  negative
## 2         abnormal  negative
## 3          abolish  negative
## 4       abominable  negative
## 5       abominably  negative
## 6        abominate  negative
## 7      abomination  negative
## 8            abort  negative
## 9          aborted  negative
## 10          aborts  negative
## 11          abound  positive
## 12         abounds  positive
## 13          abrade  negative
## 14        abrasive  negative
## 15          abrupt  negative
## 16        abruptly  negative
## 17         abscond  negative
## 18         absence  negative
## 19   absent-minded  negative
## 20        absentee  negative
## 21          absurd  negative
## 22       absurdity  negative
## 23        absurdly  negative
## 24      absurdness  negative
## 25       abundance  positive
## 26        abundant  positive
## 27           abuse  negative
## 28          abused  negative
## 29          abuses  negative
## 30         abusive  negative
## 31         abysmal  negative
## 32       abysmally  negative
## 33           abyss  negative
## 34      accessable  positive
## 35      accessible  positive
## 36      accidental  negative
## 37         acclaim  positive
## 38       acclaimed  positive
## 39     acclamation  positive
## 40        accolade  positive
## 41       accolades  positive
## 42   accommodative  positive
## 43    accomodative  positive
## 44      accomplish  positive
## 45    accomplished  positive
## 46  accomplishment  positive
## 47 accomplishments  positive
## 48          accost  negative
## 49        accurate  positive
## 50      accurately  positive
get_sentiments('bing')%>%
  group_by(sentiment)%>%
  count()
## # A tibble: 2 × 2
## # Groups:   sentiment [2]
##   sentiment     n
##   <chr>     <int>
## 1 negative   4781
## 2 positive   2005
spaceMountain %>%
  group_by(Review_ID) %>%
  unnest_tokens(output = word, input = Ride_Sentence)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  theme_economist()+
  guides(fill=F)+
  coord_flip() +
  labs(title = "Sentiment Analysis of All Reviews for Space Mountain") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

#observe more positive words than negative words overall

#see if this is true in negative reviews:
spaceMountain_Neg <- spaceMountain %>%
  filter(Rating_type == "negative")

spaceMountain_Neg %>%
  group_by(Review_ID) %>%
  unnest_tokens(output = word, input = Ride_Sentence)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  theme_economist()+
  guides(fill=F)+
  coord_flip() +
  labs(title = "Sentiment Analysis of Negative Reviews for Space Mountain") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`

#even if in overall negatively rated reviews, we still has more positive words in the sentence mentioned 'Space Mountain'. 

#see comparison bewteen review rating and sentiment
library(ggthemes)
spaceMountain %>%
  select(Review_ID, Ride_Sentence, Rating)%>%
  group_by(Review_ID, Rating)%>%
  unnest_tokens(output=word,input= Ride_Sentence)%>%
  ungroup()%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(Rating,sentiment)%>%
  summarize(n = n())%>%
  mutate(proportion = n/sum(n))%>%
  ggplot(aes(x= Rating,y=proportion,fill=sentiment))+
  geom_col()+
  theme_economist()+
  coord_flip() +
  labs(title = "Sentiment Analysis of Reviews for Space Mountain in Different Rating Categories") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.

From the “Sentiment Analysis of All Reviews for Space Mountain” plot, we see more positive sentiment than negative sentiment in all kinds of reviews. From the “Sentiment Analysis of Negative Reviews for Space Mountain”, we also observe more positive sentiment than negative sentiment when filter out only negative reviews (rated 1 or 2). But we can see the difference is smaller. And from the “Sentiment Analysis of Reviews for Space Mountain in Different Rating Categories” plot, we observe a similar pattern as before. There are both positive and negative sentiments in each rating categories. Positive reviews (4 and 5) has more positive sentiment proportion, and negative reviews (1 and 2) has more negative sentiment proportion.

Now, we want to observe the emotions in reviews mentioned ride Space Mountain.

nrc = read.table(file = 'https://raw.githubusercontent.com/pseudorational/data/master/nrc_lexicon.txt',
                 header = F,
                 col.names = c('word','sentiment','num'),
                 sep = '\t',
                 stringsAsFactors = F)
nrc = nrc[nrc$num!=0,]
nrc$num = NULL
spaceMountain %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Ride_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  arrange(desc(n))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 4 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
## # A tibble: 10 × 2
## # Groups:   sentiment [10]
##    sentiment        n
##    <chr>        <int>
##  1 anticipation  7348
##  2 positive      4774
##  3 negative      2444
##  4 joy           2409
##  5 trust         2171
##  6 fear          1938
##  7 surprise      1417
##  8 sadness       1086
##  9 anger          593
## 10 disgust        364
spaceMountain %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Ride_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  theme_wsj() +
  labs(title = "NRC Analysis of ALL Reviews Mentioned Space Mountain") +
  theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 4 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

spaceMountain_Neg %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Ride_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  theme_wsj() +
  labs(title = "NRC Analysis of Negative Reviews Mentioned Space Mountain") +
  theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 18 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

From “NRC Analysis of ALL Reviews Mentioned Space Mountain” plot, “anticipation” and “positive” emotion appear more frequently than “negative” or “fear”. In the “NRC Analysis of Negative Reviews Mentioned Space Mountain”, we still observe many “anticipation” and “positive” feelings, but “fear” and “sadness” moved up the ranking. Meaning that, the overall emotions people towards the ride Space Mountain is still more positive even if they rated 1 or 2 for the general Disneyland experience. The use of negative words to express emotion and feeling is more frequent in negative rated reviews.

Analyze Ride called Pirates (second mostly mentioned):

We want to analyze Pirates of the Caribbean Ride as well, since it is the ride with the second most number of occurrence in reviews. If we have more time in the future, we would analyze each ride one by one. But for now we only picked the first two because they have large sample sizes and the results is less biased.

rides_sum
##        Space_Mountain               Pirates       Haunted_Mansion 
##                  3465                  1293                  1063 
##            Star_Tours       Splash_Mountain           Small_World 
##                   903                   877                   866 
##         Jungle_Cruise             Peter_Pan           Big_Thunder 
##                   337                   126                   118 
##              Dinosaur                 Dumbo           Winnie_Pooh 
##                    29                    28                    24 
##             Toy_Story               Mad_Tea         Twilight_Zone 
##                    24                    23                    23 
##             Under_Sea    Expedition_Everest            Test_Track 
##                    18                    14                    10 
##         Astro_Orbiter         Mission_Space           Rock_Roller 
##                     9                     5                     5 
##         Soarin_Around          Seven_Dwarfs Tomorrowland_Speedway 
##                     4                     1                     1
#filter out reviews mentioned Space Mountain
pirates <- disneyland_ride2 %>%
  filter(Pirates == 1) 
#head(pirates)

library(ggplot2)
plot3 <- ggplot(pirates, aes(Rating)) +
         geom_bar(stat="count", position = "dodge") +
  ggtitle('Rating Distribution for Reviews Mentioned Ride pirates')
plot3

#filter out negative reviews mentioned pirates
pirates_Neg <- pirates %>%
  filter(Rating_type == "negative") 

plot4 <- ggplot(pirates_Neg, aes(Rating)) +
         geom_bar(stat="count", position = "dodge") +
  ggtitle('Rating Distribution for Negtaive Reviews Mentioned Ride Pirates')
plot4

#for all Pirates (pos and neg) reviews, detect the specific sentence containing Pirates regardless of letter case
pirates <- pirates %>%
  mutate(Pirates_Sentence = str_extract(Review_Text, "(?i)\\b[^.]*pirates[^.]*\\b"))
#pirates

From the “Rating Distribution for Reviews Mentioned Ride pirates” we again observe a similar rating pattern for all Disney reviews (regardless of mentioning ride or not) and the space mountain ride. The five-score rated review are the most, the one-score rated review are the least.

pirates %>%
  group_by(Review_ID) %>%
  unnest_tokens(output = word, input = Pirates_Sentence)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  theme_economist()+
  guides(fill=F)+
  coord_flip() +
  labs(title = "Sentiment Analysis of All Reviews for Pirates") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`

#observe more positive words than negative words overall

#see if this is true in negative reviews:
pirates_Neg <- pirates %>%
  filter(Rating_type == "negative")

pirates_Neg %>%
  group_by(Review_ID) %>%
  unnest_tokens(output = word, input = Pirates_Sentence)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  theme_economist()+
  guides(fill=F)+
  coord_flip() +
  labs(title = "Sentiment Analysis of Negative Reviews for Pirates") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`

#even if in overall negatively rated reviews, we still has more positive words in the sentence mentioned 'Space Mountain'. 

#see comparison bewteen review rating and sentiment
library(ggthemes)
pirates %>%
  select(Review_ID, Pirates_Sentence, Rating)%>%
  group_by(Review_ID, Rating)%>%
  unnest_tokens(output=word,input= Pirates_Sentence)%>%
  ungroup()%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(Rating,sentiment)%>%
  summarize(n = n())%>%
  mutate(proportion = n/sum(n))%>%
  ggplot(aes(x= Rating,y=proportion,fill=sentiment))+
  geom_col()+
  theme_economist()+
  coord_flip() +
  labs(title = "Sentiment Analysis of All Reviews for Pirates Across All Rating Categories") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.

We observe more positive reviews for both “Sentiment Analysis of All Reviews for Pirates” plot and “Sentiment Analysis of Negative Reviews for Pirates”. Meaning that the general experience with the Ride Pirates of the Caribbean is positive even if visitors gave a negative rating.
From the “Sentiment Analysis of All Reviews for Pirates Across All Rating Categories” plot, we observe a different proportion pattern. For all review rating categories, there are more positive sentiment than negative sentiment. Five-score rated reviews still has the most positive sentiment proportion, but two-score rated reviews has more positive sentiment proportion than 3-score rated reviews.

pirates %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Pirates_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  arrange(desc(n))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 12 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
## # A tibble: 10 × 2
## # Groups:   sentiment [10]
##    sentiment        n
##    <chr>        <int>
##  1 positive      1672
##  2 anticipation  1557
##  3 negative      1009
##  4 joy            856
##  5 trust          749
##  6 fear           723
##  7 sadness        504
##  8 surprise       471
##  9 anger          194
## 10 disgust        116
pirates %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Pirates_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  theme_wsj() +
  labs(title = "NRC Analysis of ALL Reviews Mentioned Pirates") +
  theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 12 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

pirates_Neg %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Pirates_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  theme_wsj() +
  labs(title = "NRC Analysis of Negative Reviews Mentioned Pirates") +
  theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 7 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

From the “NRC Analysis of ALL Reviews Mentioned Pirates” plots, we see that “positive” and “anticipation” are at the top of the ranking, follow by “negative”. We observe a similar ranking at the “NRC Analysis of Negative Reviews Mentioned Pirates” plot.

Adding Ride Features

We want to combine ride features in the “ride” dataset to our “disneyland_ride2” dataset so we can explore whether there is some relationship between ride features and rating. We filter out reviews mentioned one and only one ride because if there is more than one rides mentioned, we cannot match the corresponding ride features and cannot explore the link towards rating.

#head(disneyland_ride2,200)

#filter out reviews mentioned at least one ride
disneyland_ride3 <- disneyland_ride2 %>%
  filter(rowSums(disneyland_ride2[, 11:34]) != 0)

#filter out reviews mentioned exactly one ride
disneyland_ride4 <- disneyland_ride3[rowSums(disneyland_ride3[, 11:34] == 1) == 1, ]
#head(disneyland_ride4)

#create new column get the name of the ride mentioned in review
disneyland_ride4 <- disneyland_ride4 %>%
  mutate(Ride_Mentioned = case_when(
    `Astro_Orbiter` == 1 ~ "Astro Orbiter",
    `Big_Thunder` == 1 ~ "Big Thunder Mountain Railroad",
    `Dinosaur` == 1 ~ "Dinosaur",
    `Dumbo` == 1 ~ "Dumbo the Flying Elephant",
    `Expedition_Everest` == 1 ~ "Expedition Everest",
    `Haunted_Mansion` == 1 ~ "Haunted Mansion",
    `Small_World` == 1 ~ "It's a Small World",
    `Jungle_Cruise` == 1 ~ "Jungle Cruise",
    `Mad_Tea` == 1 ~ "Mad Tea Party",
    `Mission_Space` == 1 ~ "Mission Space",
    `Peter_Pan` == 1 ~ "Peter Pan's Flight",
    `Pirates` == 1 ~ "Pirates of the Caribbean",
    `Rock_Roller` == 1 ~ "Rock 'n' Roller Coaster",
    `Seven_Dwarfs` == 1 ~ "Seven Dwarfs Mine Train",
    `Soarin_Around` == 1 ~ "Soarin' Around the World",
    `Space_Mountain` == 1 ~ "Space Mountain",
    `Splash_Mountain` == 1 ~ "Splash Mountain",
    `Star_Tours` == 1 ~ "Star Tours",
    `Test_Track` == 1 ~ "Test Track",
    `Winnie_Pooh` == 1 ~ "The Many Adventures of Winnie the Pooh",
    `Twilight_Zone` == 1 ~ "The Twilight Zone Tower of Terror",
    `Tomorrowland_Speedway` == 1 ~ "Tomorrowland Speedway",
    `Toy_Story` == 1 ~ "Toy Story Mania",
    `Under_Sea` == 1 ~ "Under the Sea",
    TRUE ~ NA_character_
  ))

#drop irrelevant columns:
disneyland_ride4 <- disneyland_ride4 %>% 
  select(-c(11:34))
#head(disneyland_ride4,100)

#combine two dataset
#head(rides)
rides_new <- rides[, -c(2, 3)]

disneyland_ridefull <- disneyland_ride4 %>% 
                    left_join(rides_new, by = c("Ride_Mentioned" = "Ride_name"))
#disneyland_ridefull

Rides Features and Rating

R_thrill <- disneyland_ridefull %>%
  filter(Ride_type_thrill == 1)

plot_thrill <- ggplot(R_thrill, aes(Rating)) +
         geom_bar(stat="count", position = "dodge")
plot_thrill 

ggplot(data = disneyland_ridefull, aes(Rating, fill = Ride_type_all)) +
  geom_bar(stat = "count", position = "stack") +
  labs(title = "Count of Ride Ratings by Ride Type Across All Rating Categories") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))

From the above plot, we observe that for each rating categories, we have similar distribution for ride_type_all. Specifically, rides that are “thrill, big drops, dark” occupies the most in all rating types.
We want to see ride features individually, so we would separate the features and see if there is some effect towards rating.

#Thrill
thrill_prop <- disneyland_ridefull %>% 
  group_by(Rating, Ride_type_thrill) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
ggplot(thrill_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_thrill))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink"), 
                    name = "If Ride is Thrill",
                    labels = c("Not Thrilling", "Thrilling")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Thrillness") +
  theme_bw()

#head(disneyland_ridefull)
#Spinning
spin_prop <- disneyland_ridefull %>% 
  group_by(Rating, Ride_type_spinning) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
spin_prop
## # A tibble: 10 × 4
## # Groups:   Rating [5]
##    Rating Ride_type_spinning     n    prop
##     <int>              <dbl> <int>   <dbl>
##  1      1                  0   111 0.991  
##  2      1                  1     1 0.00893
##  3      2                  0   203 0.995  
##  4      2                  1     1 0.00490
##  5      3                  0   547 0.995  
##  6      3                  1     3 0.00545
##  7      4                  0  1142 0.996  
##  8      4                  1     5 0.00436
##  9      5                  0  1956 0.991  
## 10      5                  1    17 0.00862
ggplot(spin_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_spinning))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink"), 
                    name = "If Ride is Spinning",
                    labels = c("Not Spinning", "Spinning")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Spinningness") +
  theme_bw()

#head(disneyland_ridefull)
#Slow
slow_prop <- disneyland_ridefull %>% 
  group_by(Rating, Ride_type_slow) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
slow_prop
## # A tibble: 10 × 4
## # Groups:   Rating [5]
##    Rating Ride_type_slow     n  prop
##     <int>          <dbl> <int> <dbl>
##  1      1              0    73 0.652
##  2      1              1    39 0.348
##  3      2              0   136 0.667
##  4      2              1    68 0.333
##  5      3              0   366 0.665
##  6      3              1   184 0.335
##  7      4              0   773 0.674
##  8      4              1   374 0.326
##  9      5              0  1292 0.655
## 10      5              1   681 0.345
ggplot(slow_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_slow))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink"), 
                    name = "Speed of Ride",
                    labels = c("Quick", "Slow")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Speed") +
  theme_bw()

#head(disneyland_ridefull)
#check if Small Drops and Big Drops complement each other
all(disneyland_ridefull$Ride_type_small_drops == !disneyland_ridefull$Ride_type_big_drops)
## [1] FALSE
#They do not complement each other -> might be rides with no drops at all

disneyland_ridefull <- disneyland_ridefull %>%
  mutate(Ride_type_drop = if_else(Ride_type_small_drops == 1, 1,
                                  if_else(Ride_type_big_drops == 1, 2, 0)))

drop_prop <- disneyland_ridefull %>% 
  group_by(Rating, Ride_type_drop) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
drop_prop
## # A tibble: 15 × 4
## # Groups:   Rating [5]
##    Rating Ride_type_drop     n  prop
##     <int>          <dbl> <int> <dbl>
##  1      1              0    20 0.179
##  2      1              1    33 0.295
##  3      1              2    59 0.527
##  4      2              0    40 0.196
##  5      2              1    45 0.221
##  6      2              2   119 0.583
##  7      3              0   108 0.196
##  8      3              1   119 0.216
##  9      3              2   323 0.587
## 10      4              0   238 0.207
## 11      4              1   220 0.192
## 12      4              2   689 0.601
## 13      5              0   459 0.233
## 14      5              1   388 0.197
## 15      5              2  1126 0.571
ggplot(drop_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_drop))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink", "mediumpurple"), 
                    name = "If Ride has Drops",
                    labels = c("No Drops", "Small Drops", "Big Drops")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Drop Type") +
  theme_bw()

#head(disneyland_ridefull)
#Dark
dark_prop <- disneyland_ridefull %>% 
  group_by(Rating, Ride_type_dark) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
dark_prop
## # A tibble: 10 × 4
## # Groups:   Rating [5]
##    Rating Ride_type_dark     n  prop
##     <int>          <dbl> <int> <dbl>
##  1      1              0    34 0.304
##  2      1              1    78 0.696
##  3      2              0    53 0.260
##  4      2              1   151 0.740
##  5      3              0   148 0.269
##  6      3              1   402 0.731
##  7      4              0   325 0.283
##  8      4              1   822 0.717
##  9      5              0   591 0.300
## 10      5              1  1382 0.700
ggplot(dark_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_dark))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink"), 
                    name = "If Ride is Dark",
                    labels = c("Not Dark", "Dark")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Darkness") +
  theme_bw()

#head(disneyland_ridefull)
#scary
scary_prop <- disneyland_ridefull %>% 
  group_by(Rating, Ride_type_scary) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
scary_prop
## # A tibble: 10 × 4
## # Groups:   Rating [5]
##    Rating Ride_type_scary     n    prop
##     <int>           <dbl> <int>   <dbl>
##  1      1               0   111 0.991  
##  2      1               1     1 0.00893
##  3      2               0   200 0.980  
##  4      2               1     4 0.0196 
##  5      3               0   548 0.996  
##  6      3               1     2 0.00364
##  7      4               0  1142 0.996  
##  8      4               1     5 0.00436
##  9      5               0  1964 0.995  
## 10      5               1     9 0.00456
ggplot(scary_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_scary))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink"), 
                    name = "If Ride is Scary",
                    labels = c("Not Scary", "Scary")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on Rides' Scariness") +
  theme_bw()

#head(disneyland_ridefull)
#water
water_prop <- disneyland_ridefull %>% 
  group_by(Rating, Ride_type_water) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
water_prop
## # A tibble: 10 × 4
## # Groups:   Rating [5]
##    Rating Ride_type_water     n   prop
##     <int>           <dbl> <int>  <dbl>
##  1      1               0   103 0.920 
##  2      1               1     9 0.0804
##  3      2               0   190 0.931 
##  4      2               1    14 0.0686
##  5      3               0   509 0.925 
##  6      3               1    41 0.0745
##  7      4               0  1071 0.934 
##  8      4               1    76 0.0663
##  9      5               0  1811 0.918 
## 10      5               1   162 0.0821
ggplot(water_prop, aes(x = Rating, y = prop, fill = factor(Ride_type_water))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink"), 
                    name = "If Ride has Water",
                    labels = c("No Water", "Water")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on if Rides has Water") +
  theme_bw()

#head(disneyland_ridefull)
#if can use fast pass
table(disneyland_ridefull$Fast_pass)
## 
##    0    1 
##    2 3984
# Almost all ride can use fast pass, skip this feature
table(disneyland_ridefull$Classic)
## 
##    0    1 
##  633 3353
#classic
classic_prop <- disneyland_ridefull %>% 
  group_by(Rating, Classic) %>% 
  summarize(n = n()) %>% 
  mutate(prop = n/sum(n))
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.
classic_prop
## # A tibble: 10 × 4
## # Groups:   Rating [5]
##    Rating Classic     n  prop
##     <int>   <dbl> <int> <dbl>
##  1      1       0    25 0.223
##  2      1       1    87 0.777
##  3      2       0    35 0.172
##  4      2       1   169 0.828
##  5      3       0    81 0.147
##  6      3       1   469 0.853
##  7      4       0   165 0.144
##  8      4       1   982 0.856
##  9      5       0   327 0.166
## 10      5       1  1646 0.834
ggplot(classic_prop, aes(x = Rating, y = prop, fill = factor(Classic))) + 
  geom_col(position = "stack") +
  scale_fill_manual(values = c("lightblue", "pink"), 
                    name = "If Ride is Classic",
                    labels = c("No", "Yes")) +
  labs(x = "Rating", y = "Proportion", title = "Proportion of Ratings based on if Rides is Classic") +
  theme_bw()

From all of the above plots, we can see that no matter what ride feature it is, the distribution in each rating category’s bar is the same. Meaning that the ride feature is not associate with ratings. Using the “Proportion of Ratings based on Rides’ Speed” as an example, if see the distribution of Quick rides increases when rating increases, that could symbolize a quick ride would have higher rating than a slow ride. But what we got is the same distribution across all ratings, and this only symbolizes the proportion of quick and slow rides in Disneyland. The spinning graph tells that there are more non-spinning rides in the park so neutrally there will be more reviews mentioning rides with a non-spinning features across all rating categories.

ggplot(data = disneyland_ridefull, aes(Rating, fill = Age_interest_all)) +
  geom_bar(stat = "count", position = "stack") +
  labs(title = "Count of Ride Ratings by Ride Age Interest Group Across All Rating Categories") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))

The “kids, teens, teens, adults” group only exclude the “preschoolers” from “all ages” group. All rating categories have similar age group distribution.

linear_model <- lm(Rating ~ Ride_type_thrill + Ride_type_slow + Ride_type_spinning + Ride_type_drop + Ride_type_dark + Ride_type_scary + Ride_type_water + Classic +  Height_req_inches + Ride_duration_min, data = disneyland_ridefull)
summary(linear_model)
## 
## Call:
## lm(formula = Rating ~ Ride_type_thrill + Ride_type_slow + Ride_type_spinning + 
##     Ride_type_drop + Ride_type_dark + Ride_type_scary + Ride_type_water + 
##     Classic + Height_req_inches + Ride_duration_min, data = disneyland_ridefull)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2531 -0.2460 -0.0635  0.8347  1.2708 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         3.88812    0.47108   8.254  < 2e-16 ***
## Ride_type_thrill    0.11500    0.24555   0.468  0.63957    
## Ride_type_slow      0.25022    0.40990   0.610  0.54159    
## Ride_type_spinning  0.23000    0.29827   0.771  0.44068    
## Ride_type_drop     -0.19279    0.06729  -2.865  0.00419 ** 
## Ride_type_dark      0.01647    0.07547   0.218  0.82728    
## Ride_type_scary    -0.32794    0.32678  -1.004  0.31566    
## Ride_type_water     0.15925    0.28939   0.550  0.58216    
## Classic             0.07509    0.22615   0.332  0.73988    
## Height_req_inches   0.01019    0.01059   0.962  0.33589    
## Ride_duration_min   0.00310    0.02182   0.142  0.88701    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.03 on 3975 degrees of freedom
## Multiple R-squared:  0.004076,   Adjusted R-squared:  0.001571 
## F-statistic: 1.627 on 10 and 3975 DF,  p-value: 0.09263
plot(linear_model)

We built a multi-variable linear regression models to see if ride features has effect on overall disneyland rating. We only observe a small p-value for Ride_type_drop. We reject the null hypothesis and conclude that number of drops a ride has will have effect on Disneyland’s overall rating. From the graph, we conclude that visitor experiencing rides with no drop will give a higher rating. For other ride features, we have large p-values so we fail to reject null hypothesis. Meaning that reviews mentioning ride features other than drops would not affect overall rating.

Staff

From the tf/tfidf graph, we see many reviews mentioning “staff”. We would like to explore reviews and the specific sentence in the review mentioned about staff (as part of customer experience description). We hope to know the relationship between review mentioning staff and the overall disneyland experience rating.

library(dplyr)
library(stringr)

#creating new column called staff, if review_text include "staff", store 1, otherwise 0
disneyland_staff <- disneyland %>%
  mutate(Staff = case_when(grepl(c("staff"), Review_Text, ignore.case = TRUE) ~ 1, 
                                   TRUE ~ 0))
#filter out rows that mentioned staff in review_text
staff <- disneyland_staff %>%
  filter(Staff == 1)
#head(staff)
#find the exact sentence in Review_Text column that mentioned about staff
staff <- staff %>%
  mutate(Staff_Sentence = str_extract(Review_Text, "(?i)\\b[^.]*Staff[^.]*\\b"))
plot_staff <- ggplot(staff, aes(Rating)) +
         geom_bar(stat="count", position = "dodge") +
  ggtitle('Rating Distribution for Reviews Mentioned Staff')

plot_ratingall <- ggplot(disneyland, aes(Rating)) +
         geom_bar(stat="count", position = "dodge") +
  ggtitle('Rating Distribution for All Reviews')

plot_staff

plot_ratingall

table(staff$Rating)
## 
##    1    2    3    4    5 
##  397  514  903 1313 2520
table(disneyland$Rating)
## 
##     1     2     3     4     5 
##  1338  1929  4782 10086 21908
staff_prop <- table(staff$Rating)/table(disneyland$Rating)

colors <- c("lightcoral", "tan1", "wheat", "darkseagreen2", "steelblue")
barplot(staff_prop, main = "The Proportion of Reviews Mentioned Staff", xlab = "Rating in All Reviews", ylab = "Proportion", col = colors)

The plot “Rating Distribution for Reviews Mentioned Staff” shared the same rating distribution pattern as “Rating Distribution for All Reviews”. We have the most reviews in rating 5 category, and as the rating decrease the count also decreases.

staff %>%
  group_by(Review_ID) %>%
  unnest_tokens(output = word, input = Staff_Sentence)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  theme_economist()+
  guides(fill=F)+
  coord_flip() +
  labs(title = "Sentiment Analysis of All Reviews for Staff") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`

#observe more positive words than negative words overall

#see if this is true in negative reviews:
staff_Neg <- staff %>%
  filter(Rating_type == "negative")

staff_Neg %>%
  group_by(Review_ID) %>%
  unnest_tokens(output = word, input = Staff_Sentence)%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=sentiment,y=n,fill=sentiment))+
  geom_col()+
  theme_economist()+
  guides(fill=F)+
  coord_flip() +
  labs(title = "Sentiment Analysis of Negative Reviews for Staff") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`

#in overall negatively rated reviews, there are more negative review than positive review

#see comparison bewteen review rating and sentiment
library(ggthemes)
staff %>%
  select(Review_ID, Staff_Sentence, Rating)%>%
  group_by(Review_ID, Rating)%>%
  unnest_tokens(output=word,input= Staff_Sentence)%>%
  ungroup()%>%
  inner_join(get_sentiments('bing'))%>%
  group_by(Rating,sentiment)%>%
  summarize(n = n())%>%
  mutate(proportion = n/sum(n))%>%
  ggplot(aes(x= Rating,y=proportion,fill=sentiment))+
  geom_col()+
  theme_economist()+
  coord_flip() + 
  labs(title = "Sentiment Analysis of All Reviews for Staff Across All Rating Categories") +
  theme(plot.title = element_text(size = 10, color = "darkblue"))
## Joining with `by = join_by(word)`
## `summarise()` has grouped output by 'Rating'. You can override using the
## `.groups` argument.

From the above graphs, we again observe more positive sentiment reviews than negative sentiment reviews in the reviews mentioned about staff. However, in the negatively rated review (with score 1 or 2), the number of negative sentiment exceeds positive sentiments. This could explain that negative description about staff might associate with lower rating.
For all rating categories, there are both positive and negative sentiment used. Higher rating has higher proportion of positive sentiments.

staff %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Staff_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  arrange(desc(n))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 3 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.
## # A tibble: 10 × 2
## # Groups:   sentiment [10]
##    sentiment        n
##    <chr>        <int>
##  1 positive     10505
##  2 joy           7263
##  3 trust         7209
##  4 anticipation  5508
##  5 negative      3074
##  6 surprise      2221
##  7 sadness       1635
##  8 fear          1353
##  9 anger         1164
## 10 disgust        965
staff %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Staff_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  theme_wsj() +
  labs(title = "NRC Analysis of ALL Reviews Mentioned Staff") +
  theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 3 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

staff_Neg %>%
  group_by(Review_ID)%>%
  unnest_tokens(output = word, input = Staff_Sentence)%>%
  inner_join(nrc)%>%
  group_by(sentiment)%>%
  count()%>%
  ggplot(aes(x=reorder(sentiment,X = n), y=n, fill=sentiment))+
  geom_col()+
  guides(fill=F)+
  coord_flip()+
  theme_wsj() +
  labs(title = "NRC Analysis of Negtive Reviews Reviews Mentioned Staff") +
  theme(plot.title = element_text(size = 10, color = "darkgreen"))
## Joining with `by = join_by(word)`
## Warning in inner_join(., nrc): Each row in `x` is expected to match at most 1 row in `y`.
## ℹ Row 19 of `x` matches multiple rows.
## ℹ If multiple matches are expected, set `multiple = "all"` to silence this
##   warning.

From “NRC Analysis of ALL Reviews Mentioned Staff” graph, the top four emotions are all positive. But in the “NRC Analysis of ALL Reviews Mentioned Staff”, the “negative” moved to rank 2. This again proves that in lower rating reviews, there are more negative reviews mentioned about staff.

Predictive Models (using TF features)

# Add review_rating back to dataframe of features
disneyland_data = cbind(rating = disneyland$Rating,xdtm1)
disneyland_data_tfidf = cbind(rating = disneyland$Rating,xdtm_tfidf1)
head(disneyland_data)
##   rating busier day disneyland ever feel find hong kong main one queue ride
## 1      4      1   1          2    1    1    1    1    1    1   1     1    1
## 2      4      0   0          1    0    2    0    0    0    1   0     0    0
## 3      4      1   1          0    0    0    1    0    0    1   0     1    1
## 4      4      0   0          2    0    0    0    0    0    0   0     0    0
## 5      4      0   0          1    0    0    0    1    1    0   0     0    0
## 6      3      0   0          5    0    1    0    1    1    0   1     0    2
##   small street visit walk well world worth also area attract bit disney dont
## 1     1      1     1    1    1     1     1    0    0       0   0      0    0
## 2     0      1     1    0    0     0     0    1    1       2   1      2    1
## 3     1      0     1    0    0     0     0    0    0       2   1      0    0
## 4     0      0     1    0    0     0     0    0    1       0   1      0    0
## 5     0      0     0    0    0     0     0    0    0       0   0      0    0
## 6     2      0     0    0    1     1     0    0    0       1   0      4    1
##   especial even expect experiance good got great just last less like member
## 1        0    0      0          0    0   0     0    0    0    0    0      0
## 2        1    1      1          1    1   1     1    3    1    1    3      1
## 3        0    2      1          0    1   0     1    0    1    0    0      0
## 4        0    0      0          0    0   0     1    0    0    0    0      0
## 5        0    0      0          0    0   0     0    0    0    0    1      0
## 6        0    1      0          0    2   0     0    2    0    0    0      0
##   mountain now open park place realli seem since somethin staff star stay theme
## 1        0   0    0    0     0      0    0     0        0     0    0    0     0
## 2        1   2    1    2     1      1    1     1        1     1    1    1     2
## 3        0   0    0    2     0      3    0     0        2     0    0    0     0
## 4        0   0    0    1     0      1    0     0        0     0    0    0     0
## 5        0   0    0    0     0      1    0     0        0     0    0    0     0
## 6        0   0    0    1     0      1    0     0        0     0    0    0     0
##   time whole amazaaaaah.. around arrival big castle close enjoy everyone food
## 1    0     0            0      0       0   0      0     0     0        0    0
## 2    2     1            0      0       0   0      0     0     0        0    0
## 3    2     0            1      1       1   1      1     1     1        1    1
## 4    0     0            0      0       0   0      1     1     0        0    1
## 5    0     0            0      1       0   0      0     0     0        0    0
## 6    0     0            0      1       0   0      0     0     0        0    4
##   hour lot minut. much parad quit shop way will can crowd drink kid love pay
## 1    0   0      0    0     0    0    0   0    0   0     0     0   0    0   0
## 2    0   0      0    0     0    0    0   0    0   0     0     0   0    0   0
## 3    1   1      1    1     1    2    2   1    1   0     0     0   0    0   0
## 4    0   0      0    0     0    1    1   0    1   1     1     1   1    1   1
## 5    1   0      0    1     0    0    0   0    0   0     1     0   1    0   0
## 6    0   0      0    1     1    0    1   1    0   1     1     0   0    0   0
##   price work everythig took children expense fast however line managable never
## 1     0    0         0    0        0       0    0       0    0         0     0
## 2     0    0         0    0        0       0    0       0    0         0     0
## 3     0    0         0    0        0       0    0       0    0         0     0
## 4     1    1         0    0        0       0    0       0    0         0     0
## 5     0    0         1    1        0       0    0       0    0         0     0
## 6     0    0         0    0        1       3    1       1    1         1     1
##   peopl see show take ticket tri water bad daughter know though went best
## 1     0   0    0    0      0   0     0   0        0    0      0    0    0
## 2     0   0    0    0      0   0     0   0        0    0      0    0    0
## 3     0   0    0    0      0   0     0   0        0    0      0    0    0
## 4     0   0    0    0      0   0     0   0        0    0      0    0    0
## 5     0   0    0    0      0   0     0   0        0    0      0    0    0
## 6     2   2    1    1      1   1     3   0        0    0      0    0    0
##   disappoint little magic plan restaurant servicable. think week charactars.
## 1          0      0     0    0          0           0     0    0           0
## 2          0      0     0    0          0           0     0    0           0
## 3          0      0     0    0          0           0     0    0           0
## 4          0      0     0    0          0           0     0    0           0
## 5          0      0     0    0          0           0     0    0           0
## 6          0      0     0    0          0           0     0    0           0
##   eat enough fantastci fun get money photo train want better come holiday miss
## 1   0      0         0   0   0     0     0     0    0      0    0       0    0
## 2   0      0         0   0   0     0     0     0    0      0    0       0    0
## 3   0      0         0   0   0     0     0     0    0      0    0       0    0
## 4   0      0         0   0   0     0     0     0    0      0    0       0    0
## 5   0      0         0   0   0     0     0     0    0      0    0       0    0
## 6   0      0         0   0   0     0     0     0    0      0    0       0    0
##   must save say start two florida still mania space spend spent back earlier
## 1    0    0   0     0   0       0     0     0     0     0     0    0       0
## 2    0    0   0     0   0       0     0     0     0     0     0    0       0
## 3    0    0   0     0   0       0     0     0     0     0     0    0       0
## 4    0    0   0     0   0       0     0     0     0     0     0    0       0
## 5    0    0   0     0   0       0     0     0     0     0     0    0       0
## 6    0    0   0     0   0       0     0     0     0     0     0    0       0
##   firework night young familiar made age help need look min recommend wait pass
## 1        0     0     0        0    0   0    0    0    0   0         0    0    0
## 2        0     0     0        0    0   0    0    0    0   0         0    0    0
## 3        0     0     0        0    0   0    0    0    0   0         0    0    0
## 4        0     0     0        0    0   0    0    0    0   0         0    0    0
## 5        0     0     0        0    0   0    0    0    0   0         0    0    0
## 6        0     0     0        0    0   0    0    0    0   0         0    0    0
##   half smaller definitaley book mickey ablaze california comparable didnt first
## 1    0       0           0    0      0      0          0          0     0     0
## 2    0       0           0    0      0      0          0          0     0     0
## 3    0       0           0    0      0      0          0          0     0     0
## 4    0       0           0    0      0      0          0          0     0     0
## 5    0       0           0    0      0      0          0          0     0     0
## 6    0       0           0    0      0      0          0          0     0     0
##   make next. nice sure year adult beauties buy new old wonder land differ clean
## 1    0     0    0    0    0     0        0   0   0   0      0    0      0     0
## 2    0     0    0    0    0     0        0   0   0   0      0    0      0     0
## 3    0     0    0    0    0     0        0   0   0   0      0    0      0     0
## 4    0     0    0    0    0     0        0   0   0   0      0    0      0     0
## 5    0     0    0    0    0     0        0   0   0   0      0    0      0     0
## 6    0     0    0    0    0     0        0   0   0   0      0    0      0     0
##   high trip end found hotel light bring everithing although pirate thing use
## 1    0    0   0     0     0     0     0          0        0      0     0   0
## 2    0    0   0     0     0     0     0          0        0      0     0   0
## 3    0    0   0     0     0     0     0          0        0      0     0   0
## 4    0    0   0     0     0     0     0          0        0      0     0   0
## 5    0    0   0     0     0     0     0          0        0      0     0   0
## 6    0    0   0     0     0     0     0          0        0      0     0   0
##   full right happier without alway long friend pariah. part meet give watch
## 1    0     0       0       0     0    0      0       0    0    0    0     0
## 2    0     0       0       0     0    0      0       0    0    0    0     0
## 3    0     0       0       0     0    0      0       0    0    0    0     0
## 4    0     0       0       0     0    0      0       0    0    0    0     0
## 5    0     0       0       0     0    0      0       0    0    0    0     0
## 6    0     0       0       0     0    0      0       0    0    0    0     0
##   return least anothe adventure cant
## 1      0     0      0         0    0
## 2      0     0      0         0    0
## 3      0     0      0         0    0
## 4      0     0      0         0    0
## 5      0     0      0         0    0
## 6      0     0      0         0    0
head(disneyland_data_tfidf)
##   rating   busier      day disneyland     ever     feel     find     hong
## 1      4 3.506264 1.139515   2.261463 4.270367 3.566882 3.720356 4.112466
## 2      4 0.000000 0.000000   1.130732 0.000000 7.133764 0.000000 0.000000
## 3      4 3.506264 1.139515   0.000000 0.000000 0.000000 3.720356 0.000000
## 4      4 0.000000 0.000000   2.261463 0.000000 0.000000 0.000000 0.000000
## 5      4 0.000000 0.000000   1.130732 0.000000 0.000000 0.000000 4.112466
## 6      3 0.000000 0.000000   5.653658 0.000000 3.566882 0.000000 4.112466
##       kong     main      one   queue      ride    small   street    visit
## 1 4.124356 3.504219 1.705005 2.55117 0.8469021 2.944967 4.118086 1.685868
## 2 0.000000 3.504219 0.000000 0.00000 0.0000000 0.000000 4.118086 1.685868
## 3 0.000000 3.504219 0.000000 2.55117 0.8469021 2.944967 0.000000 1.685868
## 4 0.000000 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 1.685868
## 5 4.124356 0.000000 0.000000 0.00000 0.0000000 0.000000 0.000000 0.000000
## 6 4.124356 0.000000 1.705005 0.00000 1.6938041 5.889933 0.000000 0.000000
##       walk     well    world    worth     also     area  attract      bit
## 1 2.931985 2.639333 2.782955 2.875106 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.000000 0.000000 0.000000 2.584494 3.758368 5.400159 3.593599
## 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 5.400159 3.593599
## 4 0.000000 0.000000 0.000000 0.000000 0.000000 3.758368 0.000000 3.593599
## 5 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 6 0.000000 2.639333 2.782955 0.000000 0.000000 0.000000 2.700079 0.000000
##     disney     dont especial     even  expect experiance     good     got
## 1 0.000000 0.000000 0.000000 0.000000 0.00000   0.000000 0.000000 0.00000
## 2 2.367199 2.828039 3.909884 2.289439 3.09882   2.556671 2.187123 3.05265
## 3 0.000000 0.000000 0.000000 4.578877 3.09882   0.000000 2.187123 0.00000
## 4 0.000000 0.000000 0.000000 0.000000 0.00000   0.000000 0.000000 0.00000
## 5 0.000000 0.000000 0.000000 0.000000 0.00000   0.000000 0.000000 0.00000
## 6 4.734398 2.828039 0.000000 2.289439 0.00000   0.000000 4.374245 0.00000
##      great     just     last     less     like   member mountain      now
## 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 1.849821 5.965014 3.597083 4.008492 6.601919 4.190572 3.040742 7.866722
## 3 1.849821 0.000000 3.597083 0.000000 0.000000 0.000000 0.000000 0.000000
## 4 1.849821 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 5 0.000000 0.000000 0.000000 0.000000 2.200640 0.000000 0.000000 0.000000
## 6 0.000000 3.976676 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
##       open      park    place   realli     seem   since somethin    staff
## 1 0.000000 0.0000000 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000
## 2 3.228567 1.7334754 1.875502 2.336521 3.713251 4.02482 4.021891 2.874313
## 3 0.000000 1.7334754 0.000000 7.009564 0.000000 0.00000 8.043781 0.000000
## 4 0.000000 0.8667377 0.000000 2.336521 0.000000 0.00000 0.000000 0.000000
## 5 0.000000 0.0000000 0.000000 2.336521 0.000000 0.00000 0.000000 0.000000
## 6 0.000000 0.8667377 0.000000 2.336521 0.000000 0.00000 0.000000 0.000000
##       star    stay    theme     time    whole amazaaaaah..   around  arrival
## 1 0.000000 0.00000 0.000000 0.000000 0.000000     0.000000 0.000000 0.000000
## 2 4.031875 2.91912 6.960597 2.137617 3.799916     0.000000 0.000000 0.000000
## 3 0.000000 0.00000 0.000000 2.137617 0.000000     2.995216 2.714669 4.151631
## 4 0.000000 0.00000 0.000000 0.000000 0.000000     0.000000 0.000000 0.000000
## 5 0.000000 0.00000 0.000000 0.000000 0.000000     0.000000 2.714669 0.000000
## 6 0.000000 0.00000 0.000000 0.000000 0.000000     0.000000 2.714669 0.000000
##        big  castle    close    enjoy everyone     food     hour      lot
## 1 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 3 3.224856 3.88785 2.807211 2.237374 3.753987 2.064658 2.621042 2.444361
## 4 0.000000 3.88785 2.807211 0.000000 0.000000 2.064658 0.000000 0.000000
## 5 0.000000 0.00000 0.000000 0.000000 0.000000 0.000000 2.621042 0.000000
## 6 0.000000 0.00000 0.000000 0.000000 0.000000 8.258631 0.000000 0.000000
##     minut.     much    parad     quit     shop      way     will      can
## 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 3 2.966208 2.328175 2.246748 7.538270 7.451179 3.175195 2.271758 0.000000
## 4 0.000000 0.000000 0.000000 3.769135 3.725589 0.000000 2.271758 1.946493
## 5 0.000000 2.328175 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000
## 6 0.000000 2.328175 2.246748 0.000000 3.725589 3.175195 0.000000 1.946493
##      crowd    drink      kid    love      pay    price     work everythig
## 1 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000  0.000000
## 2 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000  0.000000
## 3 0.000000 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000  0.000000
## 4 2.635074 4.016633 2.046386 2.01212 4.303421 3.196505 3.761297  0.000000
## 5 2.635074 0.000000 2.046386 0.00000 0.000000 0.000000 0.000000  3.109975
## 6 2.635074 0.000000 0.000000 0.00000 0.000000 0.000000 0.000000  0.000000
##       took children  expense     fast  however     line managable    never
## 1 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
## 2 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
## 3 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
## 4 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
## 5 3.637314 0.000000 0.000000 0.000000 0.000000 0.000000  0.000000 0.000000
## 6 0.000000 2.936391 8.878398 2.634403 3.772577 2.259975  4.142058 3.361114
##     peopl    see     show     take   ticket      tri    water bad daughter know
## 1 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000  0.00000   0        0    0
## 2 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000  0.00000   0        0    0
## 3 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000  0.00000   0        0    0
## 4 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000  0.00000   0        0    0
## 5 0.00000 0.0000 0.000000 0.000000 0.000000 0.000000  0.00000   0        0    0
## 6 5.13035 4.4356 2.284691 2.534584 2.943857 3.447698 12.50124   0        0    0
##   though went best disappoint little magic plan restaurant servicable. think
## 1      0    0    0          0      0     0    0          0           0     0
## 2      0    0    0          0      0     0    0          0           0     0
## 3      0    0    0          0      0     0    0          0           0     0
## 4      0    0    0          0      0     0    0          0           0     0
## 5      0    0    0          0      0     0    0          0           0     0
## 6      0    0    0          0      0     0    0          0           0     0
##   week charactars. eat enough fantastci fun get money photo train want better
## 1    0           0   0      0         0   0   0     0     0     0    0      0
## 2    0           0   0      0         0   0   0     0     0     0    0      0
## 3    0           0   0      0         0   0   0     0     0     0    0      0
## 4    0           0   0      0         0   0   0     0     0     0    0      0
## 5    0           0   0      0         0   0   0     0     0     0    0      0
## 6    0           0   0      0         0   0   0     0     0     0    0      0
##   come holiday miss must save say start two florida still mania space spend
## 1    0       0    0    0    0   0     0   0       0     0     0     0     0
## 2    0       0    0    0    0   0     0   0       0     0     0     0     0
## 3    0       0    0    0    0   0     0   0       0     0     0     0     0
## 4    0       0    0    0    0   0     0   0       0     0     0     0     0
## 5    0       0    0    0    0   0     0   0       0     0     0     0     0
## 6    0       0    0    0    0   0     0   0       0     0     0     0     0
##   spent back earlier firework night young familiar made age help need look min
## 1     0    0       0        0     0     0        0    0   0    0    0    0   0
## 2     0    0       0        0     0     0        0    0   0    0    0    0   0
## 3     0    0       0        0     0     0        0    0   0    0    0    0   0
## 4     0    0       0        0     0     0        0    0   0    0    0    0   0
## 5     0    0       0        0     0     0        0    0   0    0    0    0   0
## 6     0    0       0        0     0     0        0    0   0    0    0    0   0
##   recommend wait pass half smaller definitaley book mickey ablaze california
## 1         0    0    0    0       0           0    0      0      0          0
## 2         0    0    0    0       0           0    0      0      0          0
## 3         0    0    0    0       0           0    0      0      0          0
## 4         0    0    0    0       0           0    0      0      0          0
## 5         0    0    0    0       0           0    0      0      0          0
## 6         0    0    0    0       0           0    0      0      0          0
##   comparable didnt first make next. nice sure year adult beauties buy new old
## 1          0     0     0    0     0    0    0    0     0        0   0   0   0
## 2          0     0     0    0     0    0    0    0     0        0   0   0   0
## 3          0     0     0    0     0    0    0    0     0        0   0   0   0
## 4          0     0     0    0     0    0    0    0     0        0   0   0   0
## 5          0     0     0    0     0    0    0    0     0        0   0   0   0
## 6          0     0     0    0     0    0    0    0     0        0   0   0   0
##   wonder land differ clean high trip end found hotel light bring everithing
## 1      0    0      0     0    0    0   0     0     0     0     0          0
## 2      0    0      0     0    0    0   0     0     0     0     0          0
## 3      0    0      0     0    0    0   0     0     0     0     0          0
## 4      0    0      0     0    0    0   0     0     0     0     0          0
## 5      0    0      0     0    0    0   0     0     0     0     0          0
## 6      0    0      0     0    0    0   0     0     0     0     0          0
##   although pirate thing use full right happier without alway long friend
## 1        0      0     0   0    0     0       0       0     0    0      0
## 2        0      0     0   0    0     0       0       0     0    0      0
## 3        0      0     0   0    0     0       0       0     0    0      0
## 4        0      0     0   0    0     0       0       0     0    0      0
## 5        0      0     0   0    0     0       0       0     0    0      0
## 6        0      0     0   0    0     0       0       0     0    0      0
##   pariah. part meet give watch return least anothe adventure cant
## 1       0    0    0    0     0      0     0      0         0    0
## 2       0    0    0    0     0      0     0      0         0    0
## 3       0    0    0    0     0      0     0      0         0    0
## 4       0    0    0    0     0      0     0      0         0    0
## 5       0    0    0    0     0      0     0      0         0    0
## 6       0    0    0    0     0      0     0      0         0    0

We split the new document-term-matrix dataframe using the Term Frequency weighting after adding the rating back into training and testing datasets, where training dataset contains 70% of the dataframe and testing has the rest.

set.seed(617)
split = sample(1:nrow(disneyland_data),size = 0.7*nrow(disneyland_data))
train = disneyland_data[split,]
test = disneyland_data[-split,]

Cart (TF features)

Firstly, we used a regression tree to predict rating using all other variables, term frequencies.

library(rpart)
#install.packages('rpart.plot')
library(rpart.plot)
tree = rpart(rating~.,train)
rpart.plot(tree)

We applied the predictions of the tree to the test sample to compute root mean square error (RMSE). The RMSE is 0.9900591.

pred_tree = predict(tree,newdata=test)
rmse_tree = sqrt(mean((pred_tree - test$rating)^2)); rmse_tree
## [1] 0.9900591

Linear Regression (TF features)

Next, we used a regression to predict rating using all other variables, term frequencies.

reg = lm(rating~.,train)
summary(reg)
## 
## Call:
## lm(formula = rating ~ ., data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2217 -0.4488  0.2174  0.6066  4.8907 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.2467335  0.0089051 476.886  < 2e-16 ***
## busier        0.0347614  0.0150880   2.304 0.021235 *  
## day           0.0333278  0.0052558   6.341 2.32e-10 ***
## disneyland    0.0105037  0.0054022   1.944 0.051867 .  
## ever         -0.1002808  0.0213443  -4.698 2.64e-06 ***
## feel         -0.0340780  0.0156235  -2.181 0.029177 *  
## find         -0.0320150  0.0161639  -1.981 0.047641 *  
## hong          0.0433834  0.0858173   0.506 0.613189    
## kong         -0.0819234  0.0858294  -0.954 0.339843    
## main         -0.0207194  0.0174152  -1.190 0.234163    
## one          -0.0209127  0.0068667  -3.046 0.002325 ** 
## queue        -0.0689125  0.0076874  -8.964  < 2e-16 ***
## ride         -0.0349307  0.0042200  -8.277  < 2e-16 ***
## small        -0.0895776  0.0136658  -6.555 5.67e-11 ***
## street        0.0442219  0.0215938   2.048 0.040579 *  
## visit         0.0172332  0.0067959   2.536 0.011224 *  
## walk          0.0033093  0.0120249   0.275 0.783163    
## well          0.0489403  0.0112812   4.338 1.44e-05 ***
## world         0.0447760  0.0115730   3.869 0.000110 ***
## worth         0.1062335  0.0128101   8.293  < 2e-16 ***
## also          0.0154785  0.0098140   1.577 0.114766    
## area         -0.0019845  0.0145306  -0.137 0.891366    
## attract      -0.0419941  0.0093605  -4.486 7.28e-06 ***
## bit           0.0714168  0.0157229   4.542 5.59e-06 ***
## disney       -0.0440046  0.0050310  -8.747  < 2e-16 ***
## dont         -0.0429540  0.0111326  -3.858 0.000114 ***
## especial      0.0201458  0.0189079   1.065 0.286672    
## even         -0.0390860  0.0098418  -3.971 7.16e-05 ***
## expect       -0.0307576  0.0132084  -2.329 0.019885 *  
## experiance   -0.0390827  0.0102264  -3.822 0.000133 ***
## good         -0.0083701  0.0086481  -0.968 0.333127    
## got          -0.0206250  0.0110294  -1.870 0.061493 .  
## great         0.1237688  0.0078111  15.845  < 2e-16 ***
## just         -0.0214620  0.0077731  -2.761 0.005765 ** 
## last         -0.0977375  0.0163408  -5.981 2.24e-09 ***
## less         -0.0198917  0.0186200  -1.068 0.285396    
## like         -0.0154008  0.0088788  -1.735 0.082832 .  
## member       -0.0085260  0.0152672  -0.558 0.576538    
## mountain      0.0352843  0.0148378   2.378 0.017413 *  
## now          -0.0775180  0.0168848  -4.591 4.43e-06 ***
## open          0.0132875  0.0121574   1.093 0.274424    
## park         -0.0087809  0.0040322  -2.178 0.029435 *  
## place         0.0113032  0.0080716   1.400 0.161418    
## realli       -0.0023677  0.0084598  -0.280 0.779572    
## seem         -0.0298034  0.0147240  -2.024 0.042966 *  
## since         0.0146792  0.0184325   0.796 0.425821    
## somethin     -0.0032142  0.0196439  -0.164 0.870028    
## staff        -0.1624885  0.0110015 -14.770  < 2e-16 ***
## star          0.0273293  0.0177327   1.541 0.123285    
## stay          0.0502498  0.0129720   3.874 0.000107 ***
## theme        -0.0085686  0.0138255  -0.620 0.535413    
## time          0.0345154  0.0053564   6.444 1.18e-10 ***
## whole        -0.0313849  0.0175505  -1.788 0.073745 .  
## amazaaaaah..  0.2149915  0.0122545  17.544  < 2e-16 ***
## around        0.0130169  0.0108747   1.197 0.231321    
## arrival      -0.0106907  0.0188520  -0.567 0.570660    
## big           0.0444973  0.0143201   3.107 0.001890 ** 
## castle       -0.0311564  0.0178376  -1.747 0.080706 .  
## close        -0.1549234  0.0100884 -15.357  < 2e-16 ***
## enjoy         0.0671976  0.0095628   7.027 2.16e-12 ***
## everyone      0.1042187  0.0174031   5.989 2.14e-09 ***
## food         -0.0231880  0.0089052  -2.604 0.009222 ** 
## hour         -0.1605690  0.0102145 -15.720  < 2e-16 ***
## lot           0.0332516  0.0095216   3.492 0.000480 ***
## minut.       -0.0457196  0.0108264  -4.223 2.42e-05 ***
## much          0.0044375  0.0098809   0.449 0.653365    
## parad         0.0317380  0.0099334   3.195 0.001399 ** 
## quit         -0.0147260  0.0164663  -0.894 0.371165    
## shop         -0.0102745  0.0154741  -0.664 0.506710    
## way          -0.0673021  0.0131706  -5.110 3.24e-07 ***
## will          0.0091372  0.0077702   1.176 0.239631    
## can           0.0822794  0.0076619  10.739  < 2e-16 ***
## crowd        -0.0982473  0.0098666  -9.958  < 2e-16 ***
## drink        -0.0095182  0.0180349  -0.528 0.597666    
## kid          -0.0568329  0.0072239  -7.867 3.75e-15 ***
## love          0.1390220  0.0081052  17.152  < 2e-16 ***
## pay          -0.1766237  0.0199262  -8.864  < 2e-16 ***
## price        -0.1002167  0.0135440  -7.399 1.41e-13 ***
## work         -0.0584159  0.0167784  -3.482 0.000499 ***
## everythig     0.0836214  0.0137833   6.067 1.32e-09 ***
## took          0.0008739  0.0164736   0.053 0.957692    
## children     -0.0684988  0.0105164  -6.514 7.47e-11 ***
## expense      -0.0959482  0.0137374  -6.984 2.92e-12 ***
## fast         -0.0056744  0.0126493  -0.449 0.653728    
## however      -0.0384122  0.0152817  -2.514 0.011956 *  
## line         -0.0373275  0.0074568  -5.006 5.60e-07 ***
## managable    -0.0639526  0.0184304  -3.470 0.000521 ***
## never        -0.0587260  0.0150032  -3.914 9.09e-05 ***
## peopl        -0.1060211  0.0089751 -11.813  < 2e-16 ***
## see           0.0199983  0.0089320   2.239 0.025167 *  
## show          0.0247205  0.0081679   3.027 0.002476 ** 
## take          0.0310479  0.0101796   3.050 0.002291 ** 
## ticket       -0.0368143  0.0087390  -4.213 2.53e-05 ***
## tri          -0.0575286  0.0142676  -4.032 5.54e-05 ***
## water         0.0112922  0.0167072   0.676 0.499116    
## bad          -0.0870567  0.0200245  -4.348 1.38e-05 ***
## daughter     -0.0113881  0.0136518  -0.834 0.404185    
## know          0.0093184  0.0165136   0.564 0.572563    
## though        0.0868489  0.0156160   5.562 2.70e-08 ***
## went         -0.0127040  0.0094198  -1.349 0.177459    
## best          0.1357465  0.0130657  10.390  < 2e-16 ***
## disappoint   -0.3420133  0.0147622 -23.168  < 2e-16 ***
## little        0.0145083  0.0117925   1.230 0.218595    
## magic         0.0825491  0.0099256   8.317  < 2e-16 ***
## plan          0.0627160  0.0141147   4.443 8.89e-06 ***
## restaurant   -0.0075581  0.0134842  -0.561 0.575132    
## servicable.  -0.1359655  0.0172513  -7.881 3.35e-15 ***
## think        -0.0608032  0.0145617  -4.176 2.98e-05 ***
## week          0.0355589  0.0188379   1.888 0.059087 .  
## charactars.  -0.0214452  0.0104403  -2.054 0.039977 *  
## eat           0.0093626  0.0170596   0.549 0.583134    
## enough        0.0038871  0.0168709   0.230 0.817779    
## fantastci     0.2072158  0.0190010  10.906  < 2e-16 ***
## fun           0.0776914  0.0101443   7.659 1.94e-14 ***
## get           0.0033358  0.0060001   0.556 0.578249    
## money        -0.3892819  0.0169312 -22.992  < 2e-16 ***
## photo         0.0191894  0.0145526   1.319 0.187306    
## train         0.0431747  0.0131936   3.272 0.001068 ** 
## want         -0.0109111  0.0113792  -0.959 0.337638    
## better       -0.0400282  0.0136792  -2.926 0.003434 ** 
## come         -0.0068127  0.0137953  -0.494 0.621421    
## holiday       0.0251624  0.0163428   1.540 0.123653    
## miss          0.0505023  0.0164743   3.066 0.002175 ** 
## must          0.1231553  0.0157154   7.837 4.79e-15 ***
## save          0.0076759  0.0206132   0.372 0.709613    
## say          -0.0525152  0.0143651  -3.656 0.000257 ***
## start         0.0260744  0.0165219   1.578 0.114538    
## two          -0.0339374  0.0120251  -2.822 0.004773 ** 
## florida      -0.0884158  0.0147722  -5.985 2.19e-09 ***
## still         0.0329640  0.0114018   2.891 0.003842 ** 
## mania        -0.0883281  0.0106946  -8.259  < 2e-16 ***
## space         0.0015665  0.0201252   0.078 0.937958    
## spend        -0.0423408  0.0184742  -2.292 0.021919 *  
## spent         0.0019915  0.0183301   0.109 0.913484    
## back          0.0063244  0.0106343   0.595 0.552034    
## earlier       0.0948149  0.0137389   6.901 5.27e-12 ***
## firework      0.0390859  0.0124611   3.137 0.001711 ** 
## night         0.0392948  0.0138091   2.846 0.004436 ** 
## young         0.0201330  0.0189283   1.064 0.287499    
## familiar     -0.0138351  0.0111688  -1.239 0.215456    
## made          0.0020972  0.0167994   0.125 0.900654    
## age           0.1007268  0.0167341   6.019 1.77e-09 ***
## help          0.0790047  0.0153683   5.141 2.75e-07 ***
## need         -0.0388105  0.0121941  -3.183 0.001461 ** 
## look         -0.0682634  0.0146244  -4.668 3.06e-06 ***
## min          -0.0727035  0.0143345  -5.072 3.96e-07 ***
## recommend     0.0551549  0.0151730   3.635 0.000278 ***
## wait         -0.0262877  0.0080805  -3.253 0.001142 ** 
## pass          0.0153463  0.0103876   1.477 0.139589    
## half         -0.0995098  0.0195573  -5.088 3.64e-07 ***
## smaller       0.0579425  0.0209182   2.770 0.005610 ** 
## definitaley   0.0740396  0.0169621   4.365 1.28e-05 ***
## book         -0.0172191  0.0147060  -1.171 0.241655    
## mickey       -0.0221618  0.0148356  -1.494 0.135234    
## ablaze        0.1097057  0.0165470   6.630 3.42e-11 ***
## california   -0.0220489  0.0159785  -1.380 0.167624    
## comparable   -0.0504678  0.0212761  -2.372 0.017697 *  
## didnt        -0.0118424  0.0127607  -0.928 0.353399    
## first         0.0441951  0.0109528   4.035 5.47e-05 ***
## make          0.0158674  0.0118069   1.344 0.178991    
## next.        -0.0033712  0.0191650  -0.176 0.860372    
## nice         -0.0015046  0.0149861  -0.100 0.920028    
## sure          0.0452913  0.0155572   2.911 0.003602 ** 
## year          0.0044386  0.0097341   0.456 0.648403    
## adult         0.0304081  0.0154970   1.962 0.049751 *  
## beauties      0.1087420  0.0197987   5.492 4.00e-08 ***
## buy           0.0268342  0.0178634   1.502 0.133059    
## new           0.0027366  0.0153195   0.179 0.858225    
## old          -0.0192292  0.0133613  -1.439 0.150115    
## wonder        0.1532861  0.0174295   8.795  < 2e-16 ***
## land          0.0106474  0.0135621   0.785 0.432413    
## differ        0.0551390  0.0154944   3.559 0.000373 ***
## clean         0.1156974  0.0178121   6.495 8.42e-11 ***
## high          0.0243933  0.0195334   1.249 0.211750    
## trip          0.0232014  0.0123032   1.886 0.059332 .  
## end          -0.0016379  0.0164994  -0.099 0.920926    
## found         0.0003821  0.0165908   0.023 0.981627    
## hotel         0.0328390  0.0105309   3.118 0.001821 ** 
## light         0.0581660  0.0193320   3.009 0.002625 ** 
## bring         0.0486872  0.0176062   2.765 0.005690 ** 
## everithing    0.0969435  0.0128367   7.552 4.42e-14 ***
## although      0.0697053  0.0192175   3.627 0.000287 ***
## pirate       -0.0044723  0.0211620  -0.211 0.832625    
## thing        -0.0183291  0.0120063  -1.527 0.126867    
## use           0.0439529  0.0112816   3.896 9.80e-05 ***
## full         -0.0220858  0.0200784  -1.100 0.271352    
## right         0.1011022  0.0190383   5.310 1.10e-07 ***
## happier       0.0577608  0.0191657   3.014 0.002583 ** 
## without       0.0535065  0.0218150   2.453 0.014183 *  
## alway         0.1245232  0.0129597   9.608  < 2e-16 ***
## long         -0.0371159  0.0109593  -3.387 0.000708 ***
## friend        0.0978418  0.0152262   6.426 1.33e-10 ***
## pariah.      -0.0877762  0.0109471  -8.018 1.11e-15 ***
## part         -0.0658069  0.0204477  -3.218 0.001291 ** 
## meet          0.0491927  0.0158316   3.107 0.001890 ** 
## give         -0.0029705  0.0218030  -0.136 0.891631    
## watch         0.0214849  0.0175563   1.224 0.221049    
## return       -0.0427751  0.0190595  -2.244 0.024821 *  
## least        -0.0580124  0.0191180  -3.034 0.002412 ** 
## anothe       -0.0334258  0.0188120  -1.777 0.075607 .  
## adventure     0.0713293  0.0193814   3.680 0.000233 ***
## cant          0.0495708  0.0190164   2.607 0.009146 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.875 on 27828 degrees of freedom
## Multiple R-squared:  0.3199, Adjusted R-squared:  0.315 
## F-statistic: 65.12 on 201 and 27828 DF,  p-value: < 2.2e-16

We applied the predictions of linear regression to the test sample. The RMSE is 0.8705454.

pred_reg = predict(reg, newdata=test)
rmse_reg = sqrt(mean((pred_reg-test$rating)^2)); rmse_reg
## [1] 0.8705454

Next, we repeated the steps above for the dataset with TF-IDF weight to predict rating.

set.seed(617)
split = sample(1:nrow(disneyland_data_tfidf),size = 0.7*nrow(disneyland_data_tfidf))
train = disneyland_data_tfidf[split,]
test = disneyland_data_tfidf[-split,]

Cart (TF-IDF features)

library(rpart); library(rpart.plot)
tree1 = rpart(rating~.,train)
rpart.plot(tree1)

We applied the predictions of the tree to the test sample. The RMSE is 0.9900591.

pred_tree1 = predict(tree1,newdata=test)
rmse_tree1 = sqrt(mean((pred_tree1 - test$rating)^2)); rmse_tree1
## [1] 0.9900591

Linear Regression (TF-IDF features)

# the output is the same as the above linear regression, so the output is not included.
reg1 = lm(rating~.,train)
summary(reg1)

We applied the predictions of linear regression to the test sample. The RMSE is 0.8705454.

pred_reg1 = predict(reg1, newdata=test)
rmse_reg1 = sqrt(mean((pred_reg1 - test$rating)^2)); rmse_reg1
## [1] 0.8705454

Two tree models have the same RMSE for test sample, and so do linear regression models. linear regression models have relative lower RMSE than the tree models. Furthermore, two tree models have almost the same coefficients, and so do linear regressions. In the tree models, money, disappoint, and hours are highlighted to predict the rating, and words related to time, such as “time” and “minute” are significant coefficients in linear regression models (these variables have p-value < 2e-16), which also reveals that waiting time is highly valued by the visitors. The time words plot and the decision trees all illustrate that about an-hour waiting time will lead to a lower rating.Food is also selected as a relative significant variable in the linear regression models, having p-values less than 0.05 in two models. This indicates that food-related issues should be valued by Disneyland Theme Park.

Conclusion

In conclusion, the analysis of key factors that visitors care about during their visits and that they give ratings accordingly to different Disneyland branches provides valuable insights to help identify visitors’ preferences, expectations, and behaviors. As Disneyland continues to expand its global presence, it must consider these factors to ensure that each location provides unique and authentic experiences to meet visitors’ expectations, leading to increased visitor satisfaction and loyalty, contributing to the overall success of Disneyland’s global brand. Ultimately, this analysis emphasizes the importance of prioritizing visitor needs and preferences and provides targeted recommendations to support Disneyland to achieve long-term success and maintain its status as a premier entertainment destination.

RQ1

First, we filtered out some common words that are unnecessary in our study by examining the top 25 words mentioned in the review text. Then, we explored the frequency of top words mentioned by different reviewers coming from different continents. We found that reviewers from different continents mention some same words such as “rides”, “time”, and “kids”, indicating all reviewers care about these topics; and some words appear to be mentioned in different frequencies by reviewers from different continents, for example, “food” is among the top 10 mentioned words in Africa, Asia, Europe, and Oceania reviewers’ reviews. For both the car adventure topic and the food related topic, we first conducted frequency analysis and sentiment analysis using afinn sentiment for the targeted reviewer groups, respectively. We found that the frequency of America and Canada reviewers mentioning car adventure topics was slightly higher than the areas other than America and Canada. This might suggest that America and Canada reviewers tend to mention more about car adventure topics than reviewers from other areas. Moreover, a positive average sentiment score for America and Canada reviews also suggest that these reviewers tend to be happy about their car adventure experiences. The frequency of Asian reviewers mentioning food-related topics was lower than areas other than Asia. This suggests that reviewers from Asian countries do not mention food more frequently than reviewers from non-Asian countries. This implies that Asian visitors may care less about food than visitors from non-Asian countries do when they visit the Disney theme parks. However, a positive average sentiment score for Asian reviews suggests that even Asian reviewers may not care about food as much as other reviewers do, they are overall satisfied about the food in the Disney theme parks they visited.

RQ2

In the Hong Kong branch, 22.23% of the reviews mentioned shopping topics. In California, only 12.81% mentioned shopping topics, which is much lower than Hong Kong. Surprisingly, 28.27% of the total reviews in the Paris branch have mentioned shopping related topics, which is higher than the Hong Kong branch. We are then able to reject our hypothesis that the Hong Kong branch received similar proportions of shopping related reviews. Moving on to the afinn sentiment analysis. We observed that among all 3 branches, more positive words are in reviews unrelated to shopping experience, and reviews unrelated to shopping experience tend to have more positive tone than reviews that talked about shopping.

Similar proportions of reviews mentioned ride experiences for California and Paris branches. Hong Kong branch has the least proportions of reviews mentioned ride experiences. In Hong Kong, the average rating is higher for those that mentioned the ride than those that did not mention. However, although the California branch tends to receive the highest ranking among all three branches, the average rating is higher for those that did not mention rides than those that mentioned. Paris tends to receive much lower average ratings compared with the other two, even regardless of mentioning rides or not. This might suggest that visitors are as satisfied about the Paris branch in general as other branches, and the ride experiences might have related to such low ratings. As for the sentiment analysis for this hypothesis, we noticed that the difference in proportion of positive words between reviews with ride experience or not is ambiguous, but the average sentiment score for ride-related reviews are much lower, indicating that reviewers may have more extreme emotions toward ride experience.

RQ3

We classified topic words into four categories: time, theme rides, dining, and customer services, and we want to see reviews mentioning these topics would affect overall rating. By our analysis, we can conclude the following. For reviews mentioning food, we reject the null hypothesis. In low rating reviews, we observed many reviews discussing food quality. For time words, we reject the null hypothesis. Specifically, when a review mentions a one-hour wait time, the overall rating tends to be lower. For most of the ride’s features, we fail to reject the null hypothesis. Meaning that in reviews mentioning specific rides, the features associated with the ride would not affect the review’s overall rating. However, we do find a special case in rides’ drop. Lastly, for reviews mentioning staff, we reject the null hypothesis. We found that in lower rating categories, the proportion of reviews mentioned about staff is larger than in higher rating categories.

Recommendation

Based on results found by previous research questions, recommendations for Disneyland to improve and develop can be divided into two aspects: opening new branches to attract more visitors and improving existing branches to increase overall satisfaction/experience.

For Opening new parks:
  1. Consider the interests and preferences of visitors from different continents when designing the park experience. Visitors from different continents may have different priorities and preferences, as reflected in the differences in the frequency of certain topics mentioned in their reviews. For example, visitors from America and Canada tend to mention car adventure topics more frequently, while visitors from Asia may not care as much about food as visitors from non-Asian countries. Based on the analysis that visitors from America and Canada tend to mention car adventure topics more frequently, it may be recommended to design or add more car adventure-themed rides in new branches in the Americas. As for Asia, while visitors from this region may not mention food as frequently, it is still important to provide a variety of food options that cater to different tastes and preferences. It may also be beneficial to design or add more cultural-themed rides or attractions that showcase the unique heritage and traditions of the region. Overall, it is important to consider the interests and preferences of visitors from different continents when designing the park experience to provide a more satisfying and enjoyable experience for all visitors.
  2. Improving branches’ support facilities to cater visitors from different countries and continents needs. For example, in the Paris branch, 28.27% of reviews mentioned shopping-related topics, which is higher than Hong Kong and California branches. By adding more Disney co-branded stores in France, Disney could collaborate with French fashion brands or designers to create exclusive products that combine French fashion with Disney’s popular characters and stories. This could appeal to French visitors who are interested in fashion and design and add a unique touch to the merchandise offerings.
Exisiting Parks:

As for improving existing parks, recommendations are provided based on four sections: food, time, ride, and staff.
Food:
1. Disneyland should address the issues of overpricing, limited and unclean dining options, poor quality/unhealthy food, and long waiting times. These problems have been highlighted in many negative reviews and are supported by the predictive model’s analysis. To improve, Disneyland can consider building more restaurants and offering online ordering to reduce waiting times. 2. Disneyland could consider expanding and diversifying its food options, particularly in non-Asian parks where food is mentioned more frequently. They could also focus on improving the quality and taste of their current food options to increase overall visitor satisfaction.
Time:
1. Visitors have expressed dissatisfaction with waiting times of more than an hour, which has led to lower ratings. Disneyland can try to reduce waiting times by improving queue management, offering fast pass or similar systems, or increasing the number of staff during peak periods.
Ride:
1. Most ride-related factors were found to have no significant impact on overall ratings, except for the number of drops. High-rated reviews mentioned more non-drop attractions. Therefore, Disneyland could consider adding non-drop attractions to improve overall visitor satisfaction.
2. Disneyland could focus on expanding and diversifying its ride offerings to attract a wider range of visitors based on different visitors’ preferences based on different countries. They could also prioritize the maintenance and upkeep of their current rides to ensure a high-quality experience for visitors.
Staff:
1. Visitors have complained about employees not doing their job, having a poor attitude, and not speaking English in the case of Hong Kong Disneyland. Disneyland should address these issues by providing employee training in language skills, customer service, and direction to improve the overall visitor experience.

Limitation

First, in our dataset, there is only “review_id” which is unique for each review, the “reviewer_id” represents each unique user is not included and remains unknown, we were not able to examine the reviews that one person wrote for different Disneyland park locations, and we were also unable to rule out the possibility of one reviewer leaving multiple reviews for one Disneyland park location. Second, Our analysis only focuses on the reviews towards Disneyland parks located in Paris, Hong Kong, and California; there are Disneyland parks in other cities and countries that are not included in the analysis and the review data was not included and analyzed in our study, therefore the generalizability of our study may be limited. Third, there might exist errors in the process of matching the ride names with the review texts, it is possible that some of the Disney character names were falsely matched with a ride name; additionally, our study was not able to identify the reviews with spelling errors and match them with the correct ride names. Therefore, our analysis may contain a small portion of inaccurate data and may miss some accurate data. Finally, our study did not rule out the impact of visiting time, seasonality, and year on reviews and ratings, since Disneyland parks have peak and off seasons, the time of visit exists as a confounding variable in our study. In addition to it, the overall text analysis could not tell some other sentiments like irony ones, the sentiment analysis is limited by the lack of variety of sentiments.

Future Study

Based on the sentiment analysis conducted on the Disneyland review dataset, several suggestions for future studies can be made. First of all, future studies could consider selecting more topics to investigate the impact that may have on visitor ratings. For instance, topics such as kids and children experiences could reveal insights on how Disneyland parks create childhood memories and shape family relationships. Similarly, the interaction between travelers and animation characters could also expose the immersive experiences. Fireworks are another core component in Disneyland experiences, further analysis on fireworks could also focus on this element and explore Disneyland’s nighttime experience. Secondly, the tree models used in this study identified the factor “money” as a significant variable, hence, future studies could further investigate the financial factors which would largely impact visitors’ perception of the park. Such an investigation could focus on the costs of tickets, food, merchandise and other expenses outside of the park like nearby hotels and transportation costs. It could also examine how visitors feel about the pricing of these factors and if they feel the expenses are worth it. Thirdly, this study did not take into account the issue of time. Follow-up scholars could divide the time into off-season and peak season for research, like comparing the visitor ratings between these two time ranges to explore the impact of crowds and wait time on visitor experience, and this would make the research more comprehensive. Finally, it’s important to note that the data only comes from three branches of Disneyland. Including data from other branches or Disney parks worldwide could provide a more diverse and comprehensive understanding of the factors that impact visitor ratings. Additionally, it could help identify more regional differences in visitor perceptions of Disney parks and provide valuable insights into how the park can better cater to the needs and preferences of visitors from different parts of the world.

References

Luo, J., Li, G., Li, G., & Law, R. (2020). Topic modelling for theme park online reviews: analysis of Disneyland. Journal of Travel & Tourism Marketing, 37(2), 272–285. https://doi.org/10.1080/10548408.2020.1740138

Disneyland Reviews. (2021, January 19). Kaggle. https://www.kaggle.com/datasets/arushchillar/disneyland-reviews

Walt Disney World Ride Data - dataset by lynne588. (2023, March 15). Data.world. https://data.world/lynne588/walt-disney-world-ride-data